In [1]:
import pandas as pd
import numpy as np
import networkx as nx

There's a whole bunch of hubway data available over at the Hubway Data Challenge website.

There's already been a bunch of visualizations submitted, especially since the submission date was way back in 2012 (I think).

I decided I'd try my hand, especially since sigma.js seems like such a useful visualization tool.

I also wanted to see if a community detection algorithm on the graph would discover anything interesting (or otherwise obvious) about the communiting patterns of hubway users.

The pipeline looks like this:

  1. Read dataset from the hubway site into pandas
  2. Create a networkx directed graph representation
  3. Run community detection
  4. Export it to .gexf
  5. Play around with the graph until something nice appears
  6. Export THAT graph to .gexf, then play around with sigma.js until something else nice appears.

Okay, here we go:

First, import all the station and trip data from hubway.

Here's what they look like after pulling them into pandas:


In [2]:
stations = pd.read_csv('../data/stations_10_12_to_11_13.csv', 
                       index_col=0)

trips = pd.read_csv('../data/hubwaydata_10_12_to_11_13.csv',
                    index_col=0,
                    parse_dates=['start_date', 'end_date'])


/Users/dan/attic/tools/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [125]:
# sigma.js requres the node IDs to be strings.
stations = stations.set_index(stations.index.map(str))
trips.start_station = trips.start_station.map(str)
trips.end_station = trips.end_station.map(str)

In [128]:
stations.head(2)


Out[128]:
terminal station municipality lat lng nb_docks install_date last_day
3 B32006 Colleges of the Fenway Boston 42.340021 -71.100812 15 4/8/2013 11/27/2013
4 C32000 Tremont St. at Berkeley St. Boston 42.345392 -71.069616 15 4/8/2013 11/27/2013

2 rows × 8 columns


In [127]:
trips.head(2)


Out[127]:
status duration start_date start_station end_date end_station bike_nr subscription_type zip_code birth_date gender
id
708191 Closed 240 2012-10-31 23:57:00 53 2012-11-01 00:01:00 67 B00412 Registered 02139 NaN Male
708190 Closed 840 2012-10-31 23:54:00 36 2012-11-01 00:08:00 115 B00491 Registered 02127 NaN Female

2 rows × 11 columns


Create the graph

Next, we've gotta create the graph.

To do this, I iterate through all the rows of the stations dataframe, adding nodes to the graph as I go.

I weight an edge between two stations $n_1$ and $n_2$ by the number of trips taken from $n_1$ to $n_2$, $t_{1,2}$.

(now that I'm writing this, I should probably normalize these weights by the total number of trips leaving a station in general. I'll update the notebook later.)


In [129]:
DG = nx.DiGraph()

# add station nodes
# sigma.js requires a string ID, an X, and a Y for each node.
for n, r in stations.iterrows():
    DG.add_node(n,
                label=r.station,
                y=r.lat,
                x=r.lng)
    
# add edges for trips.
# There's an edge between two stations if there was a trip between them.
# Edges are weighted by the number of trips
for n, r in trips.iterrows():
    if DG.has_edge(r.start_station, r.end_station):
        DG[r.start_station][r.end_station]['weight'] += 1
    else:
        DG.add_edge(r.start_station, r.end_station, weight=1)

Here's what the gross networkx visualization looks like (with the Fruchterman-Reingold layout):


In [140]:
pos = nx.fruchterman_reingold_layout(DG)
nx.draw_networkx(DG, pos)



Community detection

There's a nice little python module called communities that implements the Louvain community detection method.

I'm pretty ignorant to best practices here, so I just ran it a few times until I saw a layout in the induced community graph (image #2).


In [131]:
import community

DG = nx.Graph(DG)
partition = community.best_partition(DG)

size = float(len(set(partition.values())))
pos = nx.spring_layout(DG)
count = 0
for com in set(partition.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition.keys()
                                if partition[nodes] == com]
    nx.draw_networkx_nodes(DG, pos, list_nodes, node_size = 20,
                                node_color = str(count / size))


nx.draw_networkx_edges(DG,pos, alpha=0.8)
plt.show()


# let's see if it makes sense:
ind_graph = community.induced_graph(partition, DG)

#pos = nx.spring_layout(ind_graph)
pos = nx.fruchterman_reingold_layout(ind_graph)

nx.draw_networkx(ind_graph, pos)


Great. Now that we've run the community detection, we can then add a 'community' attribute to each node, and then export it to .gexf for use in gephi.


In [132]:
# add the communities to the graph
for n, d in DG.nodes_iter(data=True):
    DG.node[n]['community'] = partition[n]

In [ ]:
nx.write_gexf(DG, '../sigma/trips.json')

Messing around with the network is beyond the scope of this post/notebook.

In Gephi, I did the following things:

  1. Color-coded the communities
  2. removed edges with a weight less than 50, representing routes that had fewer than fifty total trips in the dataset.
  3. applied a geo-layout to keep the geographic data of the stations intact.

You can see the finished product produced with sigma.js here